Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enables ARM Thumb support #1122

Closed
wants to merge 25 commits into from

Conversation

Phosphorus15
Copy link
Contributor

@Phosphorus15 Phosphorus15 commented Jun 12, 2020

This pr presents a draft of Core Theory/KB based ARM Thumb instructions' lifter, which is mostly a incomplete skeleton presenting how the final lifter will be.
There's still some key feature not presenting, including:

  • Heap & Stack memories representation
  • Control flow & PC register(treated as normal GPR for now) representation

Moreover, as issue #951 states, the ARM lifter and Thumb lifter should eventually share the same state (switch between them, precisely), the way how to integrate this lifter with the old ARM lifter still remains a problem. @ivg any idea how we can fix this?

@ivg
Copy link
Member

ivg commented Jun 12, 2020

Heap & Stack memories representation

Neither heap nor stack exist on the level of abstraction of instruction/lifter, so there is no need to model it.

Control flow & PC register(treated as normal GPR for now) representation

The lifter has to hide the PC register, so that ld PC shall be represented as jmp (PC+n) where n is the ARM PC offser (4 or 8 bytes IIRC).

... the way how to integrate this lifter with the old ARM lifter still remains a problem

In BAP 2.0 each program label (aka address) has its own architecture, therefore we need an analysis that will identify branches that switch the architecture. There are two caveats:

  1. Right now we just assign all addresses the same architecture (the one that is in the binary header), so we need to update this code and let lifters to override the default arch

  2. Architecture classification is undecidable at least in ARM/Thumb (a branch instruction that switches the mode could be unresolved/indirect) and the same two addresses can have both thumb and arm interpretation. In other words, it is an interesting task with many discoveries that are waiting for us.

@XVilka
Copy link
Contributor

XVilka commented Jun 12, 2020

  1. Architecture classification is undecidable at least in ARM/Thumb (a branch instruction that switches the mode could be unresolved/indirect) and the same two addresses can have both thumb and arm interpretation. In other words, it is an interesting task with many discoveries that are waiting for us.

This part can be solved with a superset assembler approach #944

@ivg
Copy link
Member

ivg commented Jun 12, 2020

  1. Architecture classification is undecidable at least in ARM/Thumb (a branch instruction that switches the mode could be unresolved/indirect) and the same two addresses can have both thumb and arm interpretation. In other words, it is an interesting task with many discoveries that are waiting for us.

This part can be solved with a superset assembler approach #944

It is not needed as disassembler in BAP 2.x already speculative and superset. It is driven by the knowledge base, so it may at the same time disassemble all possible substrings in all supported architectures. The main question is the performance, we in general, don't want to have the full superset, even with invalid chains pruned (which is automatically done by our disassembler). That would be the question, how to find the right balance between precision and performance. We don't really want to double the CFG of each ARM binary.

@ivg ivg marked this pull request as draft June 12, 2020 20:21
@ivg ivg changed the title [WIP] ARM Thumb support enables ARM Thumb support Jun 12, 2020
@ivg ivg added the arm-lifter label Jun 12, 2020
@Phosphorus15
Copy link
Contributor Author

Phosphorus15 commented Jun 14, 2020

Heap & Stack memories representation

Neither heap nor stack exist on the level of abstraction of instruction/lifter, so there is no need to model it.

But we need a linear memory representation for instructions like LDR or PUSH whatsoever, simply model it using Theory.Mem maybe?

Control flow & PC register(treated as normal GPR for now) representation

The lifter has to hide the PC register, so that ld PC shall be represented as jmp (PC+n) where n is the ARM PC offser (4 or 8 bytes IIRC).

... the way how to integrate this lifter with the old ARM lifter still remains a problem

In BAP 2.0 each program label (aka address) has its own architecture, therefore we need an analysis that will identify branches that switch the architecture.

Is the lifter responsible of linking the program labels to instructions? Like in the bytoy lifter there's

let block seq data ctrl =
    Theory.Label.for_addr (Word.int seq) >>= fun label ->
    blk label data ctrl

which was called after each single instruction with current pc provided.

@Phosphorus15
Copy link
Contributor Author

Phosphorus15 commented Jun 14, 2020

We don't really want to double the CFG of each ARM binary.

Some of the info. are statically deterministic, though, in ARM ELF ABI docs we have

5.5.3 Symbol Values

In addition to the normal rules for symbol values the following rules shall also apply to symbols of type STT_FUNC:

  • If the symbol addresses an Arm instruction, its value is the address of the instruction (in a relocatable object, the offset of the instruction from the start of the section containing it).
  • If the symbol addresses a Thumb instruction, its value is the address of the instruction with bit zero set (in a relocatable object, the section offset with bit zero set).
    For the purposes of relocation the value used shall be the address of the instruction (st_value & ~1).

Note: This allows a linker to distinguish Arm and Thumb code symbols without having to refer to the map. An Arm symbol will always have an even value, while a Thumb symbol will always have an odd value. However, a linker should strip the discriminating bit from the value before using it for relocation.

Which could be defined as an ARM-only knowledge provided by the binary file (ELF etc.) loader. Still, malicious program could switch the T flag arbitrarily, and it might happens that we don't have a well-defined binary at all, so this static info. is not totally enough.

@ivg
Copy link
Member

ivg commented Jun 15, 2020

But we need a linear memory representation for instructions like LDR or PUSH whatsoever, simply model it using Theory.Mem maybe?

Yes, machine instructions are fully self-contained (unlike bytecode instructions, which sometimes need extra modeling, because they are evaluated by a VM not a CPU). Whenever you will see a load or push instruction its operands will be fully defined.

Is the lifter responsible of linking the program labels to instructions?

No, it will be lined by the IR lifter.

@ivg
Copy link
Member

ivg commented Jun 15, 2020

Note: This allows a linker to distinguish Arm and Thumb code symbols without having to refer to the map. An Arm symbol will always have an even value, while a Thumb symbol will always have an odd value. However, a linker should strip the discriminating bit from the value before using it for relocation.

It is only relevant to the linker and the way how symbols are encoded in the symbol table (in this particular abi). The mode can be switched on any jump (that doesn't involve a symbol table) and both arm and thumb instructions can have even addresses (in fact they must have even addresses due to alignment requirements).

Copy link
Contributor

@XVilka XVilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


end

(*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can safely remove this.

| _ -> raise (Lift_Error "`src` must be a register")
)
| _ -> raise (Lift_Error "`dest` must be a register")
(* the `R` bit is automatically resolved *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please separate with a new line here an in the following code.

@Phosphorus15
Copy link
Contributor Author

Phosphorus15 commented Jun 16, 2020

Note: This allows a linker to distinguish Arm and Thumb code symbols without having to refer to the map. An Arm symbol will always have an even value, while a Thumb symbol will always have an odd value. However, a linker should strip the discriminating bit from the value before using it for relocation.

It is only relevant to the linker and the way how symbols are encoded in the symbol table (in this particular abi). The mode can be switched on any jump (that doesn't involve a symbol table) and both arm and thumb instructions can have even addresses (in fact they must have even addresses due to alignment requirements).

But it provides a way of initially determines the instruction set of a symbol (with certain ABI) at least.

Btw, which way would you suggest to represent PC? Obviously it should b a concrete value like Bitvec rather than an abstract Theory.Var for the reason of addressing & labeling, but it is not correct to make the lifter itself carry a concrete state value, I guess?

@ivg ivg mentioned this pull request Jun 16, 2020
@ivg
Copy link
Member

ivg commented Jun 25, 2020

Btw, which way would you suggest to represent PC? Obviously it should b a concrete value like Bitvec rather than an abstract Theory.Var for the reason of addressing & labeling, but it is not correct to make the lifter itself carry a concrete state value, I guess?

Your guess is absolutely correct. Yes, the address of the lifted instruction is a static constant (for the target language) and is a parameter (of type Bitvec.t) for the meta language.

When you define the semantics for an instruction you build a value of type unit eff which is defined as

  type 'a eff = 'a effect knowledge

And the lifter itself is the function of type Theory.label -> unit effect knowledge, where unit effect is also known in Bap.Std as type insn and Theory.label is known as tid in Bap.Std, or term identifier, so in parlance of Bap.Std the lifter is a function of type tid -> insn knowledge and it returns a knowledge computation of the instruction semantics. That means that we can use the tid = Theory.label = program obj to obtain any information about the program that is identified by this tid (including the semantics itself, the function could be recursive). A program has a lot of properties, we can enumerate them with bapp list classes -f core-theory:program, which will output something like this,

    - bap.std:common-name        a unique name associated with the program
    - bap.std:insn               a decoded machine instruction
    - bap.std:mem                a memory region occupied by the program
    - bap.std:arch               an ISA of the program
    - core-theory:semantics      the program semantics
    - core-theory:label-aliases  the set of known program names
    - core-theory:label-ivec     the program interrupt vector
    - core-theory:label-name     the program linkage name
    - core-theory:label-addr     the program virtual address
    - core-theory:is-subroutine  is the program a subroutine entry point
    - core-theory:is-valid       is the program valid or not

Our task is to provide a value for the core-theory:semantics property (which in OCaml reflection has type unit effect and fulfill this task we can query from the knowledge base for any other property, e.g., we can get bap.std:insn which is the machine code representation (provided by the LLVM decoder) to get the decoding of the memory chunk, and the memory chunk itself is also accessible through bap.std:mem. We can get the address using core-theory:label-addr if the chunk of memory has an address. The provided label will serve us as the database key, e.g.,

let lifter label = 
   KB.collect Theory.Label.addr label >>= fun addr -> (* this is the address of the current instruction *)
   KB.collect Disasm_expert.Basic.Insn label >>= fun insn -> (* the LLVM provided decoding *)
   KB.collect Memory.slot label >>= fun mem -> (* the memory chunk, probably not needed *)
   build_the_semantics_object addr insn

Basically, you have the full access to the knowledge base in the lifter.

Besides, as a side note, the value of the PC register in some architectures is not equal to the address of the current instruction, sometimes it is shifted by some number of bytes (so it is pointing ahead of instructions), in arm it is 4 or 8 bytes, I don't remember. Also, llvm may mean by PC either the actual value of the PC register or the current instruction address. So keep this in mind.

You can also follow our discussion in the Aarch64 lifter PR (#1141), I think everything that we discuss there is applicable to this lifter as well. We may even end up with some code sharing.

And if you have any questions, please don't hesitate to ask.

@Phosphorus15
Copy link
Contributor Author

This brand new Thumb lifter has been updated to cope with the structure of #1174 , and is prepared to be individually fully functional after proper tests.

@ivg
Copy link
Member

ivg commented Jul 16, 2020

okay, let's close it, but keep in mind the discussions that have happened here.

@ivg ivg closed this Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants